This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Write an R function named explore that takes a data frame, a vector of bin sizes, and a correlation threshold as input parameters: 1) Plot a pair of blue histograms with a vertical red line at the mean (one using counts and the other density) for every numerical variable at each bin size specified in the bin sizes input parameter. You can plot individually or as a grid. If you chose to plot as a grid, there should be separate grids for each count-bin size combination and separate grids for each density-bin size combination. For example, given 5 numeric variables and a vector of three bin sizes will generate 30 individual plots or a total of 6 grid plots (with each grid plot containing 5 subplots). 2) Plot a gray bar graph for every categorical and binary variable. 3) Calculate the r2 (r-square) value for every pair of numerical variables. 4) Return the following in an R list: a. A frequency table for every categorical and binary variable b. Fornumericalvariables i. A summary statistics table for each numerical variable ii. A data frame that contains each pair of variable names and the associated r-square value. iii. A data frame that contains each pair of variable names and correlation coefficient (Pearson) for all coefficients whose absolute value is greater than the correlation threshold (do not repeat any pairs)
explore<-function(dataframe,binsize,cor_threshold){ #Define the explore function with input data dataframe, binsize and cor_threshold
require(grid) #load grid package for plots
require(ggplot2) #load ggplot2 package for plots
#Question 1: we will first find out the numeric varibales and then use for loop to draw the histgram by using ggplot()
nums<-dataframe[sapply(dataframe,is.numeric)] #using sapply() to find all numeric column and put them into variable nums
histlist<-list() #create a varable histlist to put histograms, let it be empty first
for (i in 1:length(binsize)){ #go through from the first number to the last in binsize
for (j in 1:(ncol(nums))){ #go through from the first column to the last in nums
binw<-(max(nums[,j])-min(nums[,j]))/binsize[i] #calculating binwidth for histogram by using the input binsize
histlist<-ggplot(nums,aes(x=nums[,j]),environment=environment()) #using ggplot to draw the plot for every numerical variable, let the ith column to be the x aesthetic and use environment parameter
histlist<-histlist+geom_histogram(colour="blue",fill="blue",binwidth=binw)+labs(x=colnames(nums)[j])+geom_vline(xintercept=mean(nums[,j]),colour="red") #add blue histogram with the calculated binwidth, label the x label, the y label will be counts automatically, then draw a vertical red line at the mean
print(histlist) #output histograms using counts
print(histlist+aes(y=..density..)+labs(y="density")+geom_density()) #output histograms using density and label the y label
}#finish the second loop
}#finish the first loop
#Question 2: We find out the factor and logical and binarys variables first. Then put them into a data frame, using foor loop to draw bar graph and put them into a list.
factors<-dataframe[sapply(dataframe,is.factor)] #using sapply() to find all factor columns and put them into variable factors
binarys<-data.frame(matrix(ncol=0, nrow=nrow(dataframe))) #create a data frame for binarys and set the row numbers to be the same as dataframe
a=1 #create a variable a, we will use it to write binarys columns into data frame
for (i in 1:ncol(dataframe)){ #using for loop to go through from the first column to the last in dataframe
if (sum(dataframe[,i]==1)+sum(dataframe[,i]==0)==nrow(dataframe)){ #use if() and sum() to check if there are columns only have 0s and 1s
binarys<-data.frame(binarys,dataframe[,i]) #write the binary variables into binarys variable
names(binarys)[a]=colnames(dataframe)[i] #make sure the name of the column won't change in new data frame
a=a+1 #add 1 to a in order to go to the next column in binarys
}#finish if
} #finish for loop
fnb<-data.frame(factors,binarys) #create a data frame fnl and put factors and logicals in it
plotlist<-list() #create a variable plotlist for a list of plots, make it empty, we will use it to put all plots
for(i in 1:ncol(fnb)){ #using for loop to go through all variables in fnl
plotlist[[i]]<-ggplot(fnb,aes_string(x=colnames(fnb)[i]),environment=environment())+geom_bar(colour="gray",fill="grey")+ggtitle(paste(colnames(fnb)[i],"distribution")) #put a gray bar graph for ith column in fnl into plotlist[[i]], label xlabel and write title, and use use environment parameter
} #finish for loop
print(plotlist)#output plotlist
#Question 3: In order to calcualte the r-square value between two variables, we need to create a linear regression between two variables by using lm(), then we use for loop to calculated r-squared and put them into a variable. At last, we can create ta data frame to solve 4bii at the same time.
Pair_of_variables<-c() #create a variable Pair_of_variables, will put pair of variable names in it
rsquared<-c() #create a variable rsquared, will put r-square value in it
n=1 #create a variable n that will represent position in Pair_of_variables and rsquared, we will use it to go through all positions in Pair_of_variables and rsquared, let it equal to 1 first
for (i in 1:(ncol(nums)-1)){ #use for loop to go through from the first column name to the penult in dataframenum
for (j in (i+1):ncol(nums)){ #use for loop to go through from the i+1th column name to the last in dataframenum
Pair_of_variables[n]<-paste(colnames(nums)[i],"-",colnames(nums)[j],sep="") #using paste() to write pair of variable names in a single string separated by a -, and put into Pair_of_variables
rsquared[n]<-summary(lm(nums[,i] ~ nums[,j]))$r.squared #using summary() and lm() to get the r-square value between two varaibles
n=n+1 #add 1 to n in order to go to the next positions in Pair_of_variables and rsquared
} #finish second for loop
}#finish first for loop
#Question 4bii
newdata<-data.frame(Pair_of_variables, rsquared) #create a data frame newdata and put Pair_of_variables and rsquared into it
print(newdata) #output newdata
#Question 4a: for this question, we will first create a variable to store tables. Then use for loop to create frequency tables for every categorical and binary variables. Because mtcars doesn't have factors and
tablelist<-list() #create a variable tablelist for a list of tables, make it empty, we will use it to put all tables
for (i in 1:ncol(fnb)){ #using for loop to go through all categoricals and binary variables in dataframe
tablelist[[i]]<-as.data.frame(table((fnb)[,i])) #using table() to give the counts of ith column in fnb, convert it to a data frame and put into tablelist[[i]]
names(tablelist[[i]])[1]=colnames(fnb[i]) #using names() to retain the variable name in the corresponding column name
} #finish for loop
print(tablelist) #output tablelist
#Question 4bi: We will use for loop to create the summary statistics tables for all numerical columns.
sumtable<-list() #create a variable sumtable for a list of table, make it empty
for (i in 1:ncol(nums)){ #for each numeric column in the data frame
sumtable[[i]] <- summary(nums[,i]) #let the summary table of ith column to be the ith element in sumtable
}#finish foor loop
print(sumtable) #output sumtable
#Question 4biii:We will use for loop to create two variables, one is for pair of variable names and the other is the corresponding pearson correlation coefficient. Then, use if() to seclect pearson correlation coefficient>cor_threshold and put the two variables into a data frame.
Pairofvariables<-c() ##create a variable Pairofvariables, will put pair of variable names in it
Pearson_cor_coeff<-c() #create a variable Pearson_cor_coeff, will put corresponding Pearson correlation coefficient in it
n=1 #create a variable n that will represent position in Pairofvariables and Pearson_cor_coeff, we will use it to go through all positions in Pair_of_variables and Pearson_cor_coeff, let it equal to 1 first
for (i in 1:(ncol(nums)-1)){ #use for loop to go through from the first column name to the penult in nums
for (j in (i+1):ncol(nums)){ #use for loop to go through from the i+1th column name to the last in nums
if(cor(nums[ ,i],nums[ ,j],method="pearson")>cor_threshold){ #using if() to check if the pearson correlation coefficient between ith column and jth column in nums larger than the input value cor_threshold
Pairofvariables[n] <- paste(colnames(nums)[i],"-",colnames(nums)[j],sep="") #using paste() to write pair of variable names in a single string separated by a -, and put into Pair_of_variables
Pearson_cor_coeff[n] <- cor(nums[ ,i],nums[ ,j],method="pearson") #using cor() to calculate the Pearson correlation coefficient between the ith column and jth column in nums, and write down the result in Pearson_cor_coeff
n=n+1 #add 1 to n in order to go to the next positions in Pairofvariables and Pearson_cor_coeff
}#finish if
} #finish second for loop
}#finish first for loop
perdata<-data.frame(Pairofvariables, Pearson_cor_coeff) #create a data frame called perdata, put Pairofvariables and Pearson_cor_coeff into it
print(perdata) #output perdata
}
require(ggplot2) #load ggplot2 package to get diamonds data frame
## Loading required package: ggplot2
require(datasets) #mtcars is in package datasets, just make sure we have mtcars
data(diamonds) #load diamonds data frame
logicalcol<-c() #create variable logicalcol, and let it be empty first
ratioT=length(mtcars$vs[mtcars$vs==1])/length(mtcars$vs) #Calculate the ratio of 1 in the mtcars$vs variable, and put it in variable ratioT
trail<-rbinom(nrow(diamonds),1,ratioT) #create variable trail and use rbinom() to randomly input 0 and 1 into trail based on ratioT, and let the length of trail equal to the length of diamonds
for (i in 1:(length(trail))){ #using for loop to go through from the frist value to the last in trail
if (trail[i]==1){ #using if() to check if trail[i] equals to 1
logicalcol[i]=TRUE #if trail[i] equals to 1, write TURE in logicalcol[i]
} #finish if(){}
else{
logicalcol[i]=FALSE #if trail[i] doesn't equal to 1, write FALSE in logicalcol[i]
} #finish eles{}
} #finish for loop
newdiamonds<-data.frame(diamonds,logicalcol) #create a new data frame called newdiamonds and put diamonds and logicalcol into it
explore(newdiamonds, c(5,20,50), 0.25) #test explore() by using newdiamonds
## Loading required package: grid
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## Pair_of_variables rsquared
## 1 carat-depth 0.0007966119
## 2 carat-table 0.0329849332
## 3 carat-price 0.8493305264
## 4 carat-x 0.9508087510
## 5 carat-y 0.9057751441
## 6 carat-z 0.9089474974
## 7 depth-table 0.0874849338
## 8 depth-price 0.0001133672
## 9 depth-x 0.0006395460
## 10 depth-y 0.0008608750
## 11 depth-z 0.0090105434
## 12 table-price 0.0161630291
## 13 table-x 0.0381593881
## 14 table-y 0.0337677917
## 15 table-z 0.0227794699
## 16 price-x 0.7822255540
## 17 price-y 0.7489533305
## 18 price-z 0.7417506045
## 19 x-y 0.9500429745
## 20 x-z 0.9423978849
## 21 y-z 0.9063148836
## [[1]]
## cut Freq
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
##
## [[2]]
## color Freq
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
##
## [[3]]
## clarity Freq
## 1 I1 741
## 2 SI2 9194
## 3 SI1 13065
## 4 VS2 12258
## 5 VS1 8171
## 6 VVS2 5066
## 7 VVS1 3655
## 8 IF 1790
##
## [[4]]
## logicalcol Freq
## 1 FALSE 30487
## 2 TRUE 23453
##
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
##
## [[2]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 61.00 61.80 61.75 62.50 79.00
##
## [[3]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 56.00 57.00 57.46 59.00 95.00
##
## [[4]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18820
##
## [[5]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.710 5.700 5.731 6.540 10.740
##
## [[6]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.720 5.710 5.735 6.540 58.900
##
## [[7]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.910 3.530 3.539 4.040 31.800
##
## Pairofvariables Pearson_cor_coeff
## 1 carat-price 0.9215913
## 2 carat-x 0.9750942
## 3 carat-y 0.9517222
## 4 carat-z 0.9533874
## 5 price-x 0.8844352
## 6 price-y 0.8654209
## 7 price-z 0.8612494
## 8 x-y 0.9747015
## 9 x-z 0.9707718
## 10 y-z 0.9520057
explore(mtcars,c(5,20,50),0.25) #test explore() by using mtcars
## [[1]]
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## [[2]]
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## Pair_of_variables rsquared
## 1 mpg-cyl 0.726180005
## 2 mpg-disp 0.718343340
## 3 mpg-hp 0.602437341
## 4 mpg-drat 0.463995168
## 5 mpg-wt 0.752832794
## 6 mpg-qsec 0.175296320
## 7 mpg-vs 0.440947686
## 8 mpg-am 0.359798943
## 9 mpg-gear 0.230673448
## 10 mpg-carb 0.303518437
## 11 cyl-disp 0.813663302
## 12 cyl-hp 0.692968762
## 13 cyl-drat 0.489913363
## 14 cyl-wt 0.612299668
## 15 cyl-qsec 0.349567190
## 16 cyl-vs 0.657415769
## 17 cyl-am 0.273118125
## 18 cyl-gear 0.242740085
## 19 cyl-carb 0.277716662
## 20 disp-hp 0.625599666
## 21 disp-drat 0.504403822
## 22 disp-wt 0.788508342
## 23 disp-qsec 0.188093852
## 24 disp-vs 0.504690738
## 25 disp-am 0.349549413
## 26 disp-gear 0.308657134
## 27 disp-carb 0.156006724
## 28 hp-drat 0.201384745
## 29 hp-wt 0.433948779
## 30 hp-qsec 0.501580369
## 31 hp-vs 0.522868892
## 32 hp-am 0.059148311
## 33 hp-gear 0.015801561
## 34 hp-carb 0.562218742
## 35 drat-wt 0.507571675
## 36 drat-qsec 0.008318308
## 37 drat-vs 0.193845127
## 38 drat-am 0.507957151
## 39 drat-gear 0.489454337
## 40 drat-carb 0.008242788
## 41 wt-qsec 0.030525638
## 42 wt-vs 0.307931409
## 43 wt-am 0.479549684
## 44 wt-gear 0.340223720
## 45 wt-carb 0.182846838
## 46 qsec-vs 0.554333027
## 47 qsec-am 0.052836016
## 48 qsec-gear 0.045233731
## 49 qsec-carb 0.430663050
## 50 vs-am 0.028340081
## 51 vs-gear 0.042445620
## 52 vs-carb 0.324452295
## 53 am-gear 0.630529315
## 54 am-carb 0.003310202
## 55 gear-carb 0.075115920
## [[1]]
## vs Freq
## 1 0 18
## 2 1 14
##
## [[2]]
## am Freq
## 1 0 19
## 2 1 13
##
## [[1]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
##
## [[2]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 6.000 6.188 8.000 8.000
##
## [[3]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.1 120.8 196.3 230.7 326.0 472.0
##
## [[4]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.0 96.5 123.0 146.7 180.0 335.0
##
## [[5]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.760 3.080 3.695 3.597 3.920 4.930
##
## [[6]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
##
## [[7]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 16.89 17.71 17.85 18.90 22.90
##
## [[8]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4375 1.0000 1.0000
##
## [[9]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4062 1.0000 1.0000
##
## [[10]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.000 3.688 4.000 5.000
##
## [[11]]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.812 4.000 8.000
##
## Pairofvariables Pearson_cor_coeff
## 1 mpg-drat 0.6811719
## 2 mpg-qsec 0.4186840
## 3 mpg-vs 0.6640389
## 4 mpg-am 0.5998324
## 5 mpg-gear 0.4802848
## 6 cyl-disp 0.9020329
## 7 cyl-hp 0.8324475
## 8 cyl-wt 0.7824958
## 9 cyl-carb 0.5269883
## 10 disp-hp 0.7909486
## 11 disp-wt 0.8879799
## 12 disp-carb 0.3949769
## 13 hp-wt 0.6587479
## 14 hp-carb 0.7498125
## 15 drat-vs 0.4402785
## 16 drat-am 0.7127111
## 17 drat-gear 0.6996101
## 18 wt-carb 0.4276059
## 19 qsec-vs 0.7445354
## 20 am-gear 0.7940588
## 21 gear-carb 0.2740728